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Abstract 
The internet is no doubt the biggest and the most important tool of modern civilisation. But along 
with its numerous benefits, it also comes with its own set of risks, the most important of them being 
breaches in security and privacy. 
An anomaly-based Intrusion Detection System (IDS) is a type of security system that is used to detect 
and alert on unusual or abnormal behaviour that may indicate an attack or intrusion. Unlike signature- 
based IDS, which rely on known patterns of attack, anomaly-based IDS is designed to detect 
previously unseen or unknown attacks by identifying deviations from normal patterns of behaviour. 
Multiple linear regression is a statistical technique used to analyse the relationship between a 
dependent variable and multiple independent variables. In this technique, a linear equation is 
established between the dependent variable and multiple independent variables, with the aim of 
predicting the value of the dependent attribute for a given set of values of the independent attribute. 
In this paper, we collected a data set of 125974 entries and 42 attributes from Kaggle, pre-processed 
the data and used logistic regression to predict the dependent variable (called xAttack) using 25 
independent variables, as we found a high correlation between the aforementioned variables 
The results are simulated using 10-fold cross validation, using various train test splits of the data set. 
The data has been split into 80-20,50-50, and 66-34. After testing the given data set in different train 
test splits, an accuracy of 92.73 was achieved. 
Keywords: Intrusion Detection System (IDS), Machine Learning, Multiple Linear regression, security 
breach. 


1. Introduction 

The internet has become an essential tool in modern society. A huge amount of essential and 
confidential data is present on the internet. This data might be extremely important for the security of 
the Host. But data on the internet is always at a risk of infringement. 

As a result of the recent pandemic COVID-19, a lot of employees were encouraged to work from 
home. This has led to a massive surge in the transmission of sensitive data online, requiring the 
employers to provide a safe working environment. Therefore, we need a means of security that 
protects us against possible cyber-attacks. 

An Intrusion Detection System (IDS) is a security technology designed to detect and prevent 
unauthorized access or malicious activity on a computer network or system. Its primary purpose is to 
identify and respond to potential security breaches and attacks, alerting security personnel or 
automated response systems to take action. 

This action is performed using Multiple Linear Regression (MLR). Multilinear regression is a 
statistical method used to analyze the relationship between multiple independent variables and a 
dependent variable. It is a sort of linear regression where the dependent variable is a linear 
combination of multiple independent variables. 


2. Relevant Literature 

Today, security has become a critical concern for individuals, businesses, and governments alike. 
With the increasing reliance on technology and the internet, the risk of cyber-attacks, data breaches, 
and other forms of digital threats has also risen. 
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Intrusion Detection System (IDS) [6] solves majority of the problem. The main purpose of Intrusion 
Detection System is to identify potential security breaches as early as possible so that appropriate 
action can be taken to prevent or mitigate the damage. The IDS alerts the system and security 
administrators of the malicious or anomalous activity such as attempt to access restricted resources, 
modifications to system files or unusual network traffic pattern. 

Ali H. Mirza (2018) [1] used logistic regression, neural networks, and decision trees for intrusion 
detection and reduced dataset dimensions using PCA. Ensemble learning was implemented by 
assigning weights to each classifier, and the results were used to determine if a sample was anomalous 
or not. 

T.Saranyaa et al. (2020) [2] also proposed a paper which focuses on exploring different machine 
learning algorithms used for intrusion detection systems in various environments. The study shows 
that the detection rate, false positive rate, and accuracy of the algorithms used in IDS not only depend 
on the algorithm but also on the specific application area. In future, the sought to conduct an extensive 
study of ML algorithms to provide better solution for the IDS by taking real-time dataset. 

However, Partha Ghosh et al. (2015) [3] took a different approach using modified genetic algorithm 
(GA) procedure with probabilistic selection, selective mutation, and a fitness score based on mutual 
correlation, to reduce storage and processing time without compromising accuracy. The proposed 
GA-BFSS method produced a better feature set than the ordinary GA method, and with the reduced 
set of features, an OVA classifier by LR was designed for multiclass classification. 

Christiana Ioannou et al. (2017) [4] implemented a model in detecting Selective Forward and 
Blackhole attacks in various network topologies was evaluated using the Contiki O/S. The results 
showed a promising accuracy of 91% in detecting both types of attacks simultaneously using the 
Binary Logistic Regression (BLR) model. 

In their study, Anil Lamba et al. (2015) [5] compared four supervised machine learning classifiers for 
intrusion detection using the NSL-KDD dataset. The classifiers tested were Support Vector Machine, 
Random Forest, Logistic Regression, and Gaussian Naive Bayes. Random Forest was found to be the 
most effective with an accuracy of 99%. In future they sought to add further work that would could 
consider multiclass classification and focus on important attributes for intrusion detection. 

G. Vandewiele et al. (2020) [7] projected the proper use of oversampling and undersampling of a 
highly imbalanced dataset in order to generate optimistic results. The method has been followed so 
that the majority class will not be overly represented. If overly represented, poor generalization of 
minority class would occur. It was also done so that the training sets would have similar number of 
samples. 


3. Methodology 

Data: 

In this paper, the data we have been working on has been taken from the kaggle website. The database 
is based on the networking information of an unknown university of the years 2013 and 2014 whose 
reference can be found though the mentioned link. 
https:/www.kaggle.com/datasets/what0919/intrusion-detection 

this database contains information on various details like protocol type, number of services and type 
of attack on the client. 


Table 1: Dataset Overview 


ATTRIBUES DEFINITION 


service The number of services being run at that | 34.292 
particular time 


flag Denotes the status of the operation 
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a system or network component that is 
currently experiencing a high level of activity 
or usage 


Counts the number of domain visits 60.91 


The rate at which a network device or | 0.194 
application generates "source errors" 
srv_serror_rate The percentage of TCP connections that were | 0.185 
unsuccessful due to a "syn" packet error or 
other TCP-related errors at the server end 
rerror_rate The rate at which a remote system or network | 0.132 
returns error messages in response to 
connection attempts 
srv_rerror_rate The rate at which a remote service returns 
error messages in response to requests made 
to that service 
Proportion of requests made to the same | 0.738 
service on a target host. 


diff_srv_rate The percentage of packets that were received | 0.0791 
on a server with a different service than the 
one expected 


srv_diff_host_rate Percentage of packets that were received on a | 0.0824 
server from a different host than the one 
expected 


dst_host_count Total number of connections that have been į 139.48 
made to a particular destination host within a 
given period of time 


dst_host_srv_count A metric that measures the number of distinct 
services that are available on a destination 
host 


dst_host_same_srv_rate Percentage of connections made to a 
destination host that use the same service as 
the previous connection to that host. 

dst_host_serror_rate The rate at which a destination host returns 
"destination unreachable" errors in response 
to connection attempts from a source host 

dst_host_rerror_rate percentage of connections that have received | 0.131 
an error response from a particular 
destination host, out of all connection 
attempts made to that host 


dst_host_diff_srv_rate Proportion of connections made to the same | 0.105 
service on a destination host 


dst_host_srv_serror_rate the rate at which a destination host returns 
"destination unreachable” errors in response 
to connection attempts to a specific service on 
that host 


dst_host_srv_rerror_rate percentage of connections that have received | 0.1346 
an error from a particular destination host, out 
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e of all connection attempts made to a specific 
service on that host. 
dst_host_same_src_port_rate | percentage of connections that have received | 0.376 
an error response from a particular 
destination host, out of all connection 


attempts made to a specific service on that 
host 


dst_host_srv_diff_host_rate | the percentage of connections made to a 
specific service on a destination host that 
come from a different host than the previous 


connection to that service 
transmit data between devices on a network 
0.499 
used by system administrators or advanced f 0.096 
users to perform maintenance and 
troubleshooting tasks 


Checks if the client is logged in 0.066 
The type of threat on the client, if any 


Architectural Design: 
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Research Method: 
As mentioned earlier, we have used Multiple Linear Regression (MLR), a statistical technique for 
regression analysis. First, we prioritize predicting the independent variables that influence the 
dependent variable using their correlation against each other. Now, as we have found the independent 
variables, namely- ‘service’, ‘flag’, ‘hot’, ‘count’, ‘serror_rate’, ‘srv_serror_rate’, ‘rerror_rate’, 
‘srv_rerror rate’, “same_srv_rate’, ‘diff srv_rate’,  ‘srv_diff_host_rate’, ‘dst_host_count’, 
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‘dst_host_srv_count’, ‘dst host_same_srv_rate’, ‘dst_host_diff_srv_rate’, 


‘dst host_same_src_ port rate’, ‘dst_host_srv_diff_host_rate’, ‘dst_host_serror_rate’, 
‘dst_host_srv_serror_ rate’, ‘dst host_rerror_rate’, ‘dst host _srv_rerror_rate’, ‘protocol type’, 


‘logged in’, ‘root shell’, ‘is guest login’ and a dependent variable ‘xAttack’. 

We performed the following steps in MLR to produce our final model 

STEP 1: Understanding the data: 

The first step of predicting the model is to find the dependent and independent variables. After that 
we try to develop a logistic relation between the dependent & independent variables. We then split 
the data into three parts as 4/5, 2/3, 1/2 defined as training data and the rest as testing data. 

Cross validation also known as sample testing is a method where various parts of the data are used 
to train the mode and calculate its accuracy in practice. Here we divided the dataset into 10 paths, 
each time we select a part out of the 10 as the testing data and the remaining a part as training parts. 
Confusion matrix, also known as error matrix, shows the performance of the algorithm in the form 


Predicted class 


P N 
True False 
P | Positives Negatives 
(TP) (FN) 
Actual 
Class : Ar 
False True 
N | Positives Negatives 
(FP) (TN) 


of a table. A confusion matrix shows a set of test data for which the values are true. 


Now let’s take, 

TP= TRUE POSITIVE 
TN= TRUE NEGATIVE 
FP= FALSE POSITIVE 
FN= FALSE NEGATIVE 


Now, 
TP+TN 
Accuracy = —————_—_— 
TP+TN+FP+FN 
Specificity = —LN 
TN+FP 
- TP 
Precision = ————— 
TP + FP 
Recall = _ TP 
TP+FN 
x * . . 
F1_Score = A korall” Poerio 
Recall + Precision 
Kappa(k) = accuracy-random accuracy 


1-random accuracy 

So, to find random accuracy [11], 

We know from the confusion matrix that a randomly drawn label from the dataset would be positive 
with probability Pı and negative with probability (1 — P1) 

Where, 
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We also know that our classifier produces a positive label with probability P2 and a negative label 
with probability (1 —P2) 
Where, 


Random accuracy is just the probability that the labels produce by these two processes coincide by 
chance (assuming independence): 

random accuracy = PıP2 +(1- Pı) (1- P2) 

Accuracy means that how precisely or how close the measured value reflects the originals. It must 
be calculated. 

Specificity refers to the test accuracy at identifying the probability of a negative test, provided the 
condition is absent. It is to be calculated after Accuracy. 

Precision study refers to on how precisely or accurately, the model is measured. We develop 
precision investigations to check if we are getting the correct results. 

Recall (also known as sensitivity or true positive rate) is a performance metric that measures the 
proportion of actual positive instances that were correctly identified by a model out of all positive 
instances. 

F1 score is a measure of a classification model's accuracy. It is the sympathetic mean of precision 
and recall, two metrics that are commonly used to evaluate the performance of a classifier. 

Cohen Kappa score [10] is a numerical measure used to evaluate the performance of a machine 
learning classification model, particularly when the classes are imbalanced. It ranges from -1 to 1, 
with 1 representing perfect arrangement and O representing arrangement no better than chance. 
Negative values specify less arrangement than expected by chance. 

Oversampling is a technique used in statistical analysis and machine learning to handle imbalanced 
datasets where the number of observations in one class is significantly lower than the other. In this 
technique, the inferior class is artificially increased by adding copies of its observations until it 
reaches a similar or proportional size to the superior class. 

Undersampling is a technique used in statistical analysis and machine learning to handle imbalanced 
datasets where the number of observations in one class is significantly higher than the other. In this 
technique, the superior class is reduced by randomly removing observations until it reaches a similar 
or proportional size to the inferior class. 


STEP 2: Selecting the suitable method: 

The model is made using MLR (Multiple Linear Regression). For this method we first search for 
correlation between the dependent and independent variables, then we split the data into different 
fractions such as 80 — 20, 66 — 34, 50 — 50, followed by calculating the confusion matrix. 


STEP 3: Developing equation of MLR and Confusion Matrix: 

a. The logit (logistic) regression model 

The multinomial logistic regression [8] is fairly a generalization of a binary model. In general, logistic 
regression model is used to find the probability of an existing class such as yes or no based on the 
observation of a dataset. 

It can be defined as a classification problem, where the output or target variable (y) is dependent on 
the given values or inputs (X) in a dataset. 

For a response variable Y with two measurement levels (dichotomous) and explanatory variable X, 
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let: a(x) =p (Y = 1 | X =x) = 1 —p (Y = 0 | X = x), the logistic regression model has logistic form 
for logit of this probability: 


, where the odds = 


The odds = exp(a + Bx) , and the logarithm of the odds is 
called logit, so 


The logit is the distinct logarithm of the likelihoods. The S — curve shaped for m (x) controls the 
constraint B with its rate of gain or cut. If (B > 0) the curve ascends and descends for (B < 0). 
Figure 3: S — Curve of Logistic Regression 


Logistic Regression 


@ eeeo2Oe 2 


© 


sigmoid(X) 


b. Multiple Linear regression: 

Multilinear regression [12] is a statistical modeling technique used to examine the linear relationship 
between a dependent variable and multiple independent variables. In other words, it is a method of 
estimating the values of a dependent variable based on the values of multiple independent variables. 
The equation for multilinear regression can be expressed as: 

Y = bo + bıXı + b2X2 4+... + DnXn + € 

Given, Y is referred as the dependent variable, X1, X2, Xn as the independent variables, bo is the 
intercept or constant, bı, b2, bn are the regression coefficients, and e is the residual error. The goal of 
multilinear regression is to determine the coefficients bo, bi, b2,...bn, that best fit the data to the model, 
such that the sum of the squared errors is minimized. This is typically done using an optimization 
algorithm such as ordinary least squares. 


Considering an example [9] where a person can know whether he/she is expected to have a heart 
attack or not depending upon his/her body blood pressure, weight and age. The outcome is a binomial 
nominal variable i.e., heart attack vs no heart attack. The basic goal of Multiple Logistic Regression 
is to comprehend the functional relationship between the dependent and independent variables on 
what effects the probability of the outcome to change. 

The logistic regression can be extending to models with multiple explanatory variables. Let k denotes 


number of predictors for a binary response Y by X1, X2,....... Xk, the model for log odds is 
And the alternative formula, directly specifying 
n(x), is 
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Here ß refers to the impact of X; in the odds for Y=1, controlled by other Xj. 

If one has n independent observations with p — explanatory variables then to construct the logic, one 
of the categories must be considered as the base and the rest relative to it. Due to lack of ordering, 
any category may be used as k. Let 1; denote the multiple probability of an opinion dropping in the 
j™class, to find the connection amongst this probability and the p illustrative variables, X1, X2, ......... : 
Xp, the Multiple logistic regression model then is 


Where j= 1, 2, ..., (k-1),1= 1, 2, ..., n. Since all the 2’s adds to unity, this reduces to 


For j = 1, 2, ..., (k-1), the model parameters are estimated by the method of ML. 
Practically, we use statistical software to do this fitting. 


In this model, the hypothesis that is used: 

Ho: None of the controlled variable X1, X2 and X3 is significantly related to Y 

H,: At least one of the controlled variables X1, X2 and X3 is significantly related to Y 
The model of Multiple Logistic Regression can be represented as: 

y=atbx, +b,X +e +b,x 


Where i = 1,2,3......n 


Where, 

y = xAttack = Shows the type of threat on the client, if any 
a = Constant variable 

bı = Coefficient of first controlled variable 

b2 = Coefficient of second controlled variable 

b3 = Coefficient of third controlled variable 

b4 = Coefficient of fourth controlled variable and so on 


xı = service 
x = flag 
X3 = count 


x4 = serror_rate 

X5 = same _srv _rate 

x6 = dst_host_srv_count 

x7 = dst_host_same_srv_rate 
xs = dst_host_serror_rate 

x9 = dst_host_srv_serror_rate 
X10 = protocol_type 

X11 = logged_in 

X12 = hot 
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Y S 


X13 = srv_serror_rate 

X14 = rerror_rate 

X15 = srv_rerror_rate 

X16 = diff_srv_rate 

X17 = srv_diff_host_rate 

Xıg = dst_host_count 

X19 = dst_host_diff_srv_rate 

x20 = dst_host_same_src_port_rate 

X21 = dst_host_srv_diff_host_rate 

X22 = dst_host_rerror_rate 

X23 = dst_host_srv_rerror_rate 

X24 = root_shell 

X25 = is_guest_login 

In the case of bı, X is the mean of service. In the case of b2, X is the mean of flag. In the case of bs, x 
is the mean of count. In the case of b4, X is the mean of serror_rate. 

In every case of b, y is the mean of xAttack. 

4. Results and Discussion 

After analysing this model, we get the results that are given below. Where all the value ranges are in 
percentage except for kappa which ranges between 0 and 1. 


Confusion Matrix: 


For 80 - 20% train-test split: 


Predicted Class 


0 1 2 3 4 

a To 4436 91 177 244 157 
=j 170 5539 15 46 18 
5 
8 2 94 7 4736 3 65 

3 122 49 12 4167 21 

4 99 0 402 0 4050 

Calculated Results: 
Precision Recall Fl-score | Specificity | Kappa Accuracy 

0 90 87 88 97.5 

1 97 96 97 99.2 

2 89 97 92 96.9 

3 93 95 94 98.5 

4 94 89 91 98.7 

Macro avg. |93 93 93 98.16 0.90 92.7 

Weighted 93 93 93 98 

avg. 

For 66 - 34% train-test split: 
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Confusion Matrix: 


Predicted Class 


0 1 2 3 4 
0 7533 128 310 424 245 
1 301 9415 27 78 37 
2 174 9 7966 7 122 
3 194 61 16 7047 34 
4 156 0 889 0 6850 
Calculated Results: 
Precision Recall Fl-score | Specificity | Kappa Accuracy 
0 90 87 89 97.5 
1 98 96 97 99.3 
2 87 96 91 96.3 
3 93 96 95 98.5 
4 94 87 90 98.7 
Macro avg. |92 92 92 98.06 0.90 92.4 
Weighted 93 92 92 98 
avg. 
For 50 - 50% train-test split: 
Confusion Matrix: 
Predicted Class 
0 1 Z 3 4 
0 10996 205 447 630 373 
1 439 13843 42 114 53 
2 276 13 11825 10 182 
3 280 86 25 10289 52 
4 234 0 1323 0 10061 
Calculated Results: 
Precision Recall Fl-score | Specificity | Kappa Accuracy 
0 90 87 88 97.4 
1 98 96 97 99.3 
2 87 96 91 96.2 
3 93 96 95 98.5 
4 94 87 90 98.6 
Macro avg. | 92 92 92 98 0.90 92.3 
Weighted 92 92 92 98 
avg. 
For 10-fold cross-validation: 
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Test Cases Accuracy 
01 92.45 
02 92.66 
03 92.55 
04 92.51 
05 92.88 
06 92.77 
07 92.87 
08 92.75 
09 92.55 
10 92.63 


5. Conclusion 

The present study employs the Multiple Linear Regression (MLR) statistical technique to construct 
an Intrusion Detection System (IDS). To achieve this, the entirety of the data has been segregated 
into three distinct paths, which are commonly known as the train-test-split approach. Subsequently, 
a 10-fold cross-validation methodology has been implemented, and the confusion matrix has been 
developed to evaluate the efficacy of the predictor. 

The recorded accuracy for the 4/5, 2/3, and 1/2 train-test-split are 92.7, 92.4 and 92.3 respectively. 
Our work encompasses the application of the IDS which can be beneficial in various industries such 
as finance, healthcare, and government sectors where data privacy and security are of utmost 
importance. By using this system, organizations can enhance their security measures, reduce the risk 
of data breaches, and ultimately protect their reputation and customer trust. Moreover, the IDS can 
also aid in the forensic investigation of cyber-attacks by providing real-time alerts and precise 
information about the intrusion attempts. Overall, the IDS based on logistic regression has a 
promising potential to improve the cybersecurity landscape and prevent threats before they cause any 
significant harm. 
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